fix(plan-reviews): restore RECOMMENDATION + Completeness split + Codex ELI10 (v1.6.3.0) by garrytan · Pull Request #1149 · garrytan/gstack

garrytan · 2026-04-22T19:33:01Z

Summary

Two-part fix for AskUserQuestion format regressions in /plan-ceo-review and /plan-eng-review, measured on both Claude Opus 4.7 and Codex (GPT-5.4).

v1.6.2.0 — Claude regression. A user on Opus 4.7 reported /plan-ceo-review and /plan-eng-review stopped showing the RECOMMENDATION: Choose X line and the Completeness: N/10 per-option score. Investigation showed the real failure mode: on kind-differentiated questions (mode selection, architectural A-vs-B, cherry-pick Add/Defer/Skip), Opus 4.7 was fabricating filler scores (10/10 on every option, conveys nothing) or dropping the format when the metric didn't fit. Fix splits Completeness: N/10 application by question type: coverage-differentiated options get scores, kind-differentiated options get Note: options differ in kind, not coverage — no completeness score. instead.

v1.6.3.0 — Codex follow-up. User reported Codex (GPT-5.4) was failing the same pattern 10/10 times — skipping the ELI10 explanation and the RECOMMENDATION line on AskUserQuestion calls, forcing manual "ELI10 and don't forget to recommend" re-prompts every time. Root cause: the gpt.md model overlay's "No preamble / Prefer doing over listing" rule was training Codex to skip the exact prose the user needs for decision-making. Fix adds a "AskUserQuestion is NOT preamble" carve-out to gpt.md and hardens step 2 of the AskUserQuestion Format rule ("Simplify (ELI10, ALWAYS)" with explicit "not optional verbosity" framing).

Test Coverage

Two new periodic-tier eval files, 4 cases each, pinned to the model family under test:

Claude — test/skill-e2e-plan-format.test.ts (claude-opus-4-7):

Case	Type	Pre-fix	Post-fix
plan-ceo-review mode selection	kind	✗ fabricated `10/10` on all 4 modes	✓ RECOMMENDATION + "options differ in kind" note
plan-ceo-review approach menu	coverage	✗ regex missed `bolded`	✓ RECOMMENDATION + `Completeness: 5/7/10`
plan-eng-review coverage issue	coverage	✓ passed	✓ passes
plan-eng-review kind issue	kind	✗ fabricated `9/9/5` on kind question	✓ RECOMMENDATION + "options differ in kind" note

Codex — test/codex-e2e-plan-format.test.ts (codex-cli via codex exec):

Case	Type	Pre-fix (measured, 10/10 fail)	Post-fix (v1.6.3.0)
plan-ceo-review mode selection	kind	No ELI10, no RECOMMENDATION	✓ ELI10 + RECOMMENDATION + "options differ in kind"
plan-ceo-review approach menu	coverage	Bare options list	✓ ELI10 + RECOMMENDATION + `Completeness: 5/7/10`
plan-eng-review coverage issue	coverage	Bare options list	✓ ELI10 + RECOMMENDATION + Completeness
plan-eng-review kind issue	kind	Fabricated filler on kind	✓ ELI10 + RECOMMENDATION + kind note

Eval pass record

Pass	Result	Cost	Duration
Phase 1 baseline — Claude (pre-fix)	1/4 assertions pass (evidence)	$2.19	332s
Phase 3 post-fix — Claude	4/4 pass	$1.84	274s
Phase 3b regression sweep — `skill-e2e-plan.test.ts`	12/12 pass, no drift	$5.19	1484s
Codex eval (v1.6.3.0 fix applied)	4/4 pass	$0 (Codex billing)	517s

Pre-Landing Review

Three plan-phase reviews completed:

CEO Review (HOLD_SCOPE): 4 findings raised, 3 folded into plan, 0 critical gaps.
Eng Review (FULL_REVIEW): 3 issues found, all folded — completeness-section conflict resolved, phantom template anchor corrected, cross-skill regression sweep added.
DX Review (TRIAGE): score 6/10 → 8/10. Critical finding folded in: don't fabricate Completeness: X/10 on kind-differentiated questions.

Plan Completion

All phases shipped:

Phase 1 (baseline eval) — landed, captured regression evidence.
Phase 2 (preamble + template fix) — resolver split, both preamble locations synchronized, 3 template anchors.
Phase 3 (re-run eval) — 4/4 pass on Claude.
Phase 3b (regression sweep) — 12/12 pass on direct neighbor.
Follow-up scope (Codex) — gpt.md carve-out, 4 new Codex eval cases, 4/4 pass.

Phase 4 (literal in-template scaffolding fallback) not needed.

Verification Results

bun test — 448+ passing, 0 failing after golden fixture refresh.
gen-skill-docs --host all — clean across all hosts (claude, codex, factory, gbrain, gpt-5.4, hermes, kiro, opencode, openclaw, slate, cursor).
Claude eval: 4/4 pass on Opus 4.7.
Claude regression sweep: 12/12 pass on skill-e2e-plan.test.ts.
Codex eval: 4/4 pass on GPT-5.4 via codex exec.

Test plan

All free tests pass (bun test — 448+ tests, host-config goldens refreshed)
Phase 1 baseline eval captured Claude regression (3/4 format assertions fail pre-fix)
Phase 3 post-fix Claude eval: 4/4 pass
Phase 3b regression sweep: 12/12 pass (skill-e2e-plan.test.ts, ~$5 spend, no drift)
Codex eval: 4/4 pass (ELI10 + RECOMMENDATION + correct coverage-vs-kind)
All T2 skills regenerated consistently across all hosts
Golden fixtures refreshed (claude-ship, codex-ship, factory-ship)

🤖 Generated with Claude Code

Four-case periodic-tier eval that captures the verbatim AskUserQuestion text /plan-ceo-review and /plan-eng-review produce, then asserts the format rule is honored: RECOMMENDATION always, Completeness: N/10 only on coverage-differentiated options, and an explicit "options differ in kind" note on kind-differentiated options. Cases: - plan-ceo-review mode selection (kind-differentiated) - plan-ceo-review approach menu (coverage-differentiated) - plan-eng-review per-issue coverage decision - plan-eng-review per-issue architectural choice (kind-differentiated) Classified periodic because behavior depends on Opus non-determinism — gate-tier would flake and block merges. Test harness instructs the agent to write its would-be AskUserQuestion text to $OUT_FILE rather than invoke a real tool (MCP AskUserQuestion isn't wired in the test subprocess). Regex predicates then validate the captured content. Cost: ~$2 per full run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…stion type Opus 4.7 users reported /plan-ceo-review and /plan-eng-review stopped emitting the RECOMMENDATION line and per-option Completeness: X/10 scores. E2E capture showed the real failure mode: on kind-differentiated questions (mode selection, architectural A-vs-B, cherry-pick), Opus 4.7 either fabricated filler scores (10/10 on every option — conveys nothing) or dropped the format entirely when the metric didn't fit. Fix is at two layers: 1. scripts/resolvers/preamble/generate-ask-user-format.ts splits the old run-on step 3 into: - Step 3 "Recommend (ALWAYS)": RECOMMENDATION is required on every question, coverage- or kind-differentiated. - Step 4 "Score completeness (when meaningful)": emit Completeness: N/10 only when options differ in coverage. When options differ in kind, skip the score and include a one-line explanatory note. Do not fabricate scores. 2. scripts/resolvers/preamble/generate-completeness-section.ts updates the Completeness Principle tail to match. Without this, the preamble contained two rules (one conditional, one unconditional) and the model hedged. Template anchors reinforce the distinction where agent judgment is most likely to drift: - plan-ceo-review Section 0C-bis (approach menu) gets the coverage-differentiated anchor. - plan-ceo-review Section 0F (mode selection) gets the kind-differentiated anchor. - plan-eng-review CRITICAL RULE section gets the coverage-vs-kind rule for every per-issue AskUserQuestion raised during the review. Regenerated SKILL.md for all T2 skills + golden fixtures refreshed. Every skill using the T2 preamble now has the same conditional scoring rule. Verified via new periodic-tier eval (test/skill-e2e-plan-format.test.ts): all 4 cases fail on prior behavior, all 4 pass with this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…regressions

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-04-22T19:45:14Z

E2E Evals: ✅ PASS

68/68 tests passed | $8.74 total cost | 12 parallel runners

Suite	Result	Status	Cost
e2e-browse	7/7	✅	$0.33
e2e-deploy	6/6	✅	$1.35
e2e-design	3/3	✅	$0.48
e2e-plan	8/8	✅	$1.59
e2e-qa-workflow	3/3	✅	$1.3
e2e-review	6/6	✅	$1.32
e2e-workflow	4/4	✅	$0.52
llm-judge	25/25	✅	$0.5
e2e-deploy	6/6	✅	$1.35

12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

Four-case periodic-tier eval mirrors test/skill-e2e-plan-format.test.ts but drives the plan review skills via codex exec instead of claude -p. Context: Codex under the gpt.md "No preamble / Prefer doing over listing" overlay tends to skip the Simplify/ELI10 paragraph and the RECOMMENDATION line on AskUserQuestion calls. Users have to manually re-prompt "ELI10 and don't forget to recommend" almost every time. This test pins the behavior so regressions surface. Cases: - plan-ceo-review mode selection (kind-differentiated) - plan-ceo-review approach menu (coverage-differentiated) - plan-eng-review per-issue coverage decision - plan-eng-review per-issue architectural choice (kind-differentiated) Assertions on captured AskUserQuestion text: - RECOMMENDATION: Choose present (all cases) - Completeness: N/10 present on coverage, absent on kind - "options differ in kind" note present on kind - ELI10 length floor (>400 chars) — catches bare options-only output Cost: ~\$2-4 per full run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Follow-up to v1.6.2.0. Codex (GPT-5.4) under the gpt.md overlay treated "No preamble / Prefer doing over listing" as license to skip the Simplify paragraph and the RECOMMENDATION line on AskUserQuestion calls. Users had to manually re-prompt "ELI10 and don't forget to recommend" almost every time. Two layers: 1. model-overlays/gpt.md — adds an explicit "AskUserQuestion is NOT preamble" carve-out. The "No preamble" rule applies to direct answers; AskUserQuestion content must emit the full format (Re-ground, Simplify/ELI10, Recommend, Options). Tells the model: if you find yourself about to skip any of these, back up and emit them — the user will ask anyway, so do it the first time. 2. scripts/resolvers/preamble/generate-ask-user-format.ts — step 2 renamed to "Simplify (ELI10, ALWAYS)" with explicit "not optional verbosity, not preamble" framing. Step 3 "Recommend (ALWAYS)" hardened: "Never omit, never collapse into the options list." All T2 skills regenerated across all hosts. Golden fixtures refreshed (claude-ship, codex-ship, factory-ship). Updated the ELI10 assertion in test/gen-skill-docs.test.ts to match the new wording. Codex compliance to be verified empirically via test/codex-e2e-plan-format.test.ts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two test infrastructure bugs in the initial Codex eval landed in the prior commit: 1. sandbox: 'read-only' (the default) blocked Codex from writing $OUT_FILE. Test reported "STATUS: BLOCKED" and exited 0 without a capture file. Fixed: sandbox: 'workspace-write' for all 4 cases, allowing writes inside the tempdir. 2. recordCodexResult called a non-existent evalCollector.record() API (I invented it). The real surface is addTest() with a different field schema. Aligned with test/codex-e2e.test.ts pattern. With both fixed, the eval now actually measures Codex AskUserQuestion format compliance. All 4 cases pass on v1.6.2.0 with the gpt.md carve-out: RECOMMENDATION always, Completeness: N/10 only on coverage, "options differ in kind" note on kind, ELI10 explanation present. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds the Codex ELI10 + RECOMMENDATION carve-out scope landed after v1.6.2.0's Claude-verified fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

….6.4.0 Main shipped v1.6.3.0 (Codex ELI10 + RECOMMENDATION fix, #1149) and also took the v1.6.2.0 version slot (plan-reviews RECOMMENDATION + Completeness split) while this branch was at 1.6.2.0 without a CHANGELOG entry. Version-number collision resolved per CLAUDE.md: branch bumps above main's latest, accepts main's two new CHANGELOG entries. VERSION: 1.6.4.0 (above main's 1.6.3.0). package.json: synced to 1.6.4.0. CHANGELOG: main's v1.6.3.0 + v1.6.2.0 entries accepted, placed above our v1.5.2.0 entry in reverse-chronological order. Auto-merged: many SKILL.md regenerations from main's preamble changes. No real conflicts in security source files. Security test suite: 87 pass, 0 fail post-merge (security.test.ts + content-security.test.ts).

garrytan and others added 4 commits April 22, 2026 01:10

Merge remote-tracking branch 'origin/main' into garrytan/plan-review-…

00e8a85

…regressions

chore: bump version and changelog (v1.6.2.0)

d591ad2

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

garrytan and others added 4 commits April 22, 2026 21:34

chore: bump version and changelog (v1.6.3.0)

62fa719

Adds the Codex ELI10 + RECOMMENDATION carve-out scope landed after v1.6.2.0's Claude-verified fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

garrytan changed the title ~~fix(plan-reviews): restore RECOMMENDATION + split Completeness by question type (v1.6.2.0)~~ fix(plan-reviews): restore RECOMMENDATION + Completeness split + Codex ELI10 (v1.6.3.0) Apr 23, 2026

garrytan merged commit 69733e2 into main Apr 23, 2026
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(plan-reviews): restore RECOMMENDATION + Completeness split + Codex ELI10 (v1.6.3.0)#1149

fix(plan-reviews): restore RECOMMENDATION + Completeness split + Codex ELI10 (v1.6.3.0)#1149
garrytan merged 8 commits intomainfrom
garrytan/plan-review-regressions

garrytan commented Apr 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Coverage

Eval pass record

Pre-Landing Review

Plan Completion

Verification Results

Test plan

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Evals: ✅ PASS

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented Apr 22, 2026 •

edited

Loading

github-actions Bot commented Apr 22, 2026 •

edited

Loading